Probabilistic Comparative String Analysis
نویسنده
چکیده
Comparative string data has proven to be a valuable resource for improving the accuracy of computational methods for string analysis. In this report we describe the characteristics of comparative string data, focusing on biological sequences, and natural language text. We then describe a general probabilistic framework for analyzing pairs of strings, show how posterior based methods can be used to improve accuracy, and discuss ways to extend the framework to multiple sequences. We apply the posterior based probabilistic framework to sequence alignment, which is the most fundamental problems in sequence analysis. While this is a well studied problem, we show that posterior based methods can produce biological sequence alignments that are more accurate than any of the current state of the art methods. Although comparative gene finding is a more complex problem than sequence alignment, it can be modeled using a similar probabilistic model. We describe a comparative gene finding algorithm that uses posterior probabilities to integrate comparative data from multiple sequences. Finally, we discuss our plans for future work. 1 Biological and natural language comparative string data Natural language text and biological sequences are typically represented as strings. Many string analysis algorithms and data structures, such as string matching and suffix trees, have been developed and used extensively in both domains. In this paper we focus on the use of comparative data for improved string analysis accuracy. The main motivation for using comparative data for string analysis is that while the signal to noise ratio might be very low in individual strings, it can be enhanced by comparing related strings that share similar signals. In this section we present a few examples of comparative string data from the biological and natural language domains. In later sections we discuss in more details the specifics of comparative string analysis algorithms.
منابع مشابه
A Lossy Data Compression Based on String Matching: Preliminary Analysis and Suboptimal Algorithms
A practical suboptimal algorithm (source coding) for lossy (non-faithful) data compression is discussed. This scheme is based on an approximate string matching, and it naturally extends lossless (faithful) Lempel-Ziv data compression scheme. The construction of the algorithm is based on a careful probabilistic analysis of an approximate string matching problem that is of its own interest. This ...
متن کاملA New Probabilistic Plan Recognition Algorithm Based on String Rewriting
This document formalizes and discusses the implementation of a new, more efficient probabilistic plan recognition algorithm called Yet Another Probabilistic Plan Recognizer, (Yappr). Yappr is based on weighted model counting, building its models using string rewriting rather than tree adjunction or other tree building methods used in previous work. Since model construction is often the most com...
متن کاملProbabilistic Modeling of Data Structures
Professor Arne Andersson's Letter-to-the-Editor concerning our paper \On the Balance Property of Patricia Tries: External Path Length Viewpoint" Theor. Comp. Sci., 68, 1989 motivated us to present some thoughts about probabilistic analysis of data structures on words. The intention of this note is to discuss potential advantages and disadvantages of probabilistic analyses, and in particular to ...
متن کاملFinding the Most Probable String and the Consensus String: an Algorithmic Study
The problem of finding the most probable string for a distribution generated by a weighted finite automaton or a probabilistic grammar is related to a number of important questions: computing the distance between two distributions or finding the best translation (the most probable one) given a probabilistic finite state transducer. The problem is undecidable with general weights and is NP-hard ...
متن کاملMaximal Derivations for Probabilistic Strings in Stochastic Languages
A probabilistic string is a sequence of probability vectors. Each vector specifies a probability distribution over the possible symbols at its location in the string. In a probabilistic grammar a probability is assigned to every derivation. Given a probabilistic string and a probabilistic grammar the concept of a maximal derivation is defined. Algorithms for finding the maximal derivation for p...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005